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Abstract AutoTutor helps students learn by holding a conversation in natural lan- 
guage. AutoTutor is adaptive to the learners’ actions, verbal contributions, and in some 
systems their emotions. Many of AutoTutor’s conversation patterns simulate human 
tutoring, but other patterns implement ideal pedagogies that open the door to computer 
tutors eclipsing human tutors in learning gains. Indeed, current versions of AutoTutor 
yield learning gains on par with novice and expert human tutors. This article selectively 
highlights the status of AutoTutor’s dialogue moves, learning gains, implementation 
challenges, differences between human and ideal tutors, and some of the systems that 
evolved from AutoTutor. Current and future AutoTutor projects are investigating three- 
party conversations, called trialogues, where two agents (such as a tutor and student) 
interact with the human learner. 
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The Tutorial Dialogue of AutoTutor 


This article reflects on a paper published in 2001 on “Teaching Tactics and Dialogue in 
AutoTutor”, coauthored by Graesser et al. (2001). AutoTutor is a pedagogical agent that 
holds a conversation with students in natural language and simulates the dialogue 
moves of human tutors as well as ideal pedagogical strategies (Graesser et al. 2004; 
Graesser et al. 2008; see Nye et al. 2014, for an in depth history of 17 years of 
AutoTutor). My colleagues and I were inspired by the notion that there is something 
about conversation mechanisms in a tutoring session that help people learn (Graesser 
et al. 1995). And indeed, untrained tutors do help students learn better than classroom 


Submission to Special Issue of International Journal of Artificial Intelligence in Education on 25th 
Anniversary. 


>< Arthur C. Graesser 
graesser@memphis.edu 


Department of Psychology & Institute for Intelligent Systems, University of Memphis, 202 
Psychology Building, Memphis, TN 38152-3230, USA 


Q Springer 


Int J Artif Intell Educ (2016) 26:124-132 125 


interactions and various other ecological controls (Graesser et al. 2011). We were also 
inspired by the notion that some of the discourse moves of tutors could be improved if 
they were guided by ideal pedagogical principles. So a combination of natural dis- 
course interaction and ideal tutor moves would be the magic formula to improve 
student learning. 

We wrestled with the possibility that ideal computer tutoring moves may be different 
sometimes than normal conversational moves of human tutors. For example, our 
analyses of human tutors revealed that they are prone to follow principles of conver- 
sational politeness so they are reluctant to give negative feedback when a student’s 
contribution is incorrect or vague (Graesser et al. 1995; Person et al. 1995). Accurate 
feedback sometimes needs to be sacrificed in order to promote confidence and self- 
efficacy in the student (Lepper and Woolverton 2002). However, many students expect 
the computers to be accurate rather than polite. Consequently, there is a trade-off 
between feedback accuracy and the promotion of politeness or self-esteem. There also 
appeared to be illuminating differences in the pragmatic ground rules of communica- 
tion with computers versus humans (Person et al. 1995). Given the various trade-offs 
and incompatible predictions, we envisioned a program of research to investigate the 
impact of specific tutoring strategies and conversation patterns on student learning and 
motivation. This program of research continues to evolve among colleagues investi- 
gating automated tutorial dialogue both in Memphis (Nye et al. 2014; Rus et al. 2013) 
and other labs (e.g., Dzikovska et al. 2014; Johnson and Lester 2015; Ward et al. 2013). 

The AutoTutor project was launched in 1997 at a point in history when animated 
conversational agents emerged and penetrated learning environments. The agents were 
computerized talking heads or embodied animated avatars that generate speech, ac- 
tions, facial expressions, and gestures. Some of the agents were very rigid and scripted, 
whereas AutoTutor attempted to adapt to the knowledge states, verbosity, and the 
emotional states of the learner. AutoTutor was indeed successful in tracking the 
student’s knowledge states and adaptively generating dialogue moves (Graesser et al. 
2004; Jackson and Graesser 2006; Nye et al. 2014; VanLehn et al. 2007). We also 
developed an affect-sensitive AutoTutor that responded intelligently to the emotions of 
the student, such as confusion, frustration, and boredom (D’Mello and Graesser 2012). 
The power of conversational agents is that they can precisely specify what the agent 
expresses and does under specific conditions, whereas humans could never exhibit such 
precision. Agents can guide the learner on what to do next, deliver didactic instruction, 
hold collaborative conversations, and model ideal behavior, strategies, reflections, and 
social interactions. Pedagogical agents have become increasingly popular in contem- 
porary adaptive learning environments (DeepTutor, Rus et al. 2013; Betty’s Brain, 
Biswas et al. 2010; iSTART, McNamara et al. 2006; Crystal Island, Rowe et al. 
2010; Guru Tutor, Olney et al. 2012; Operation ARIES, Millis et al. 2011), just to 
name a few systems. These systems have covered topics in STEM (physics, biology, 
computer literacy), reading comprehension, scientific reasoning, and other domains and 
skills. 

AutoTutor and these other systems with pedagogical agents have helped students 
learn compared to various control conditions. In the case of AutoTutor, reports covering 
multiple studies have reported average learning gains that vary between 0.3 sigma (Nye 
et al. 2014) and 0.8 (Graesser et al. 2008) when compared to reading text for an 
equivalent amount of time; the effect sizes are substantially higher in comparisons with 
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pre-tests and no-study controls (Graesser et al. 2004; VanLehn et al. 2007). Human 
tutors have not differed greatly from AutoTutor and other ITS’s with natural language 
interaction in experiments that provide direct comparisons with trained human tutors 
(Olney et al. 2012; VanLehn 2011; VanLehn et al. 2007). For example, in a direct 
comparison between AutoTutor and 1-to-1 human tutoring with experienced tutors in 
computer-mediated conversations (either typed or spoken), the learning gains were 
virtually equivalent on the topic of Newtonian physics (VanLehn et al. 2007). Given 
these encouraging results from human and computer tutoring, we investigated what it is 
about conversation that helps student learning and motivation. 


Conversation Patterns in AutoTutor and Human Tutors 


We conducted a series of experiments that attempted to identify the features of 
AutoTutor that might account for improvements in learning (Graesser et al. 2004, 
2008; Kopp et al. 2012; VanLehn et al. 2007). It is beyond the scope of this article to 
cover all of these features, but a few are particularly noteworthy. One noteworthy finding 
is that it is not the talking head that accounts for most of the improvement, but rather the 
content of what the agent says and the student says. The talking head has only a small 
advantage over the agent’s conveying its dialogue moves in print or spoken modalities. 
Learning from AutoTutor is not appreciably different from conditions where the learner 
is guided to read small snippets of text or summaries of a solution at opportunistic points 
in time. From the standpoint of student input modality, learning is no different when 
students express their contributions in speech or keyboard (D’Mello et al. 2011). Simply 
put, it is the content that matters: What gets expressed at the right time in a conversation? 

Another noteworthy conclusion is that we were impressed with the robustness of the 
core conversation mechanisms in both AutoTutor and most human tutoring. As mentioned 
earlier, many of the core conversation mechanisms in AutoTutor are similar to human 
tutoring. We documented major conversation mechanisms of human tutors who tutored 
middle school children in mathematics and college students in research methods. The 
detailed anatomy of human tutoring was based on near 100 tutoring sessions that were 
videotaped, transcribed, and analyzed in depth (Graesser et al. 1997; Graesser and Person 
1994; Graesser et al. 1995; Person et al. 1994; Person et al. 1995). In particular, one 
discourse mechanism in both AutoTutor and human tutoring is called expectation & 
misconception-tailored dialogue (EMT dialogue). The human tutors anticipate particular 
correct answers (called expectations) and particular misunderstandings (misconceptions) 
when they ask the students challenging questions (or problems) and track the students’ 
answers. As the students express their answers, which are distributed over multiple 
conversational turns, their contributions are compared with expectations and misconcep- 
tions through semantic pattern matching. The tutors give feedback to the students’ answers 
with respect to matching the expectations or misconceptions. Some feedback is short, 
consisting of positive, neutral, or negative expressions either in words, intonation, or facial 
expressions. After the short feedback, the tutor tries to lead the student to express the 
expectations (good answers) through multiple dialogue moves, such as pumps (“What 
else”), hints, or prompts to get the students to express specific words. When the student 
fails to answer the question correctly, the tutor contributes information as assertions. The 
pump-hint-prompt-assertion cycles are implemented in AutoTutor (and are frequent in 
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human tutoring, Graesser et al. 1995) to extract or cover particular sentence-like expec- 
tations. Eventually, all of the expectations are covered and the exchange is finished for the 
main question or problem. 

It is feasible to implement EMT dialogue computationally because it relies on 
semantic pattern matching and attempts to achieve pattern completion (through hints 
and prompts). This is a simpler mechanism than interpreting natural language from 
scratch, which is beyond the boundaries of reliable natural language processing. EMT 
dialogue is not only frequent in human tutoring but creates reasonably smooth conver- 
sations in AutoTutor and helps students learn. Interestingly, human tutors rarely use 
sophisticated tutoring strategies that are difficult to implement on computer, such as 
bona fide Socratic tutoring, modeling-scaffolding-fading, building on prerequisites, and 
dialogue moves that scaffold metacognitive strategies (Cade et al. 2008; Graesser et al. 
1995). Automated computer tutors will possibly show major advantages over human 
tutors when the systems can reliably implement these more sophisticated strategies. 

AutoTutor successfully implemented nearly all of the conversational mechanisms of 
human tutors, but one notable exception is that it could not handle most of the student 
questions. Student questions are infrequent in most classroom and tutoring environ- 
ments because the teacher or tutor tends to control the agenda (Graesser et al. 1995). 
However, when students do ask questions, the relevance and correctness of the answers 
is disappointing in AutoTutor, as it is in other automated environments. We have had to 
implement diversionary tactics to handle the students questions, such as “How would 
you answer that question?” or “AutoTutor cannot answer that question now.” As a 
consequence, the frequency of student questions unfortunately extinguishes quickly in 
tutoring sessions with AutoTutor (Graesser and McNamara 2010). 

We continued to question the use of human tutors, even expert tutors, as the gold standard 
in the design of AutoTutor. We identified a number of blind spots and questionable tactics of 
human tutors (Graesser et al. 2011) that could potentially be improved by incorporating ideal 
tutoring strategies. For example, tutors are prone to give a summary recap of a solution to a 
problem, or answer to a difficult question, that required many conversational turns. It would 
be better to sometimes have the student give the summary recap in order to promote active 
student learning, to encourage the student to practice articulating the information, or to allow 
the tutor to diagnose remaining deficits. As another example, tutors often assume that the 
student understands what the tutor expresses in an exchange whereas students often do not 
understand, even partially. Indeed, there often is a large gulf between the knowledge of the 
student and that of the tutor. It sometimes would be better for the tutor to ask follow up 
questions to verify the extent to which the student understands what the tutor is attempting to 
communicate. Ideal tutoring strategies are needed to augment or replace some of the typical 
conversation patterns in human tutoring. 

One of the pervasive challenges throughout the development of AutoTutor and 
subsequent learning environments has been optimizing the semantic match scores 
between the students’ verbal contributions and AutoTutor’s anticipated answers (both 
the expectations and misconceptions). The student’s contributions over dozens of 
conversational turns in a single dialogue are constantly compared semantically with 
the set of expectations and misconceptions. There is a speech act classifier that 
segments the student’s verbal input within a turn into speech acts and assigns each 
speech act to a category, such question, statement, meta-cognitive expression (e.g., “I 
do not know”) or short response, as designated in the Dialogue Advancer Network 
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(Graesser et al. 2001). The statements are the only speech acts that are compared with 
the expectations and misconceptions through semantic matching algorithms. An ex- 
pectation (or misconception) is considered covered if it meets or exceeds some 
threshold parameter for matching. 

We have evaluated many semantic matchers over the years. The best results are a 
combination of latent semantic analysis (LSA) (Landauer et al. 2007), frequency 
weighted word overlap (rarer words and negations have higher weight), and 
regular expressions. In fact, LSA plus regular expressions have had high 
reliability scores in comparisons with human experts versus pairs of human 
experts (Cai et al. 2011). Interestingly, syntactic computations did not prove 
useful in these analyses because a high percentage of the students’ contributions 
are telegraphic, elliptical, and ungrammatical. Researchers who have developed 
tutorial dialogue systems with deep syntactic parsers (e.g., BEETLE HU, 
Dzikovska et al. 2014) routinely point out the limitations of syntactic parsers 
when there are low quality language contributions of students. 

We learned, after many years, that a semantic match algorithm with impressive 
fidelity will not necessarily go the distance in meeting the students’ wishes. There are 
two problems that continue to haunt us. The first problem addresses the students’ 
standards on what it means to cover a sentence-like answer correctly. If a good answer 
has four content words (A, B, C, D) that ideally are expressed, the students want full 
credit if they can express only one or two of the distinctive words (e.g., A and B). They 
get frustrated when their partial answers only get neutral or negative feedback from the 
tutor; students think they have covered the sentence answer but AutoTutor does not 
score it as covered unless the students express the remaining words (C and D). The 
students assume that the assumed shared knowledge should be sufficient to fill in the 
remaining words, but AutoTutor wants to see a more complete answer articulated. The 
second problem addresses the semantic blur that invariably occurs between expecta- 
tions and misconceptions when the algorithms rely on statistical algorithms like LSA, 
word overlap and regular expressions. Students may get negative feedback 
when their statements match a misconception more than an expectation; or 
positive feedback when they express something erroneous. This semantic blur 
produces inaccurate feedback which can end up confusing or frustrating the 
student. Although we do everything we can to engineer the content and 
threshold parameters, these errors still occasionally occur because of the vague- 
ness of language. One practical solution to this challenge is to have AutoTutor 
give neutral short feedback after these uncertain or borderline semantic matches 
so that the student is not misled or frustrated when the semantic matches are 
imperfect. Another approach is to provide more discriminating hints and 
prompts when there is a semantic blur between expectations and misconcep- 
tions. The hints and prompts would more cleanly differentiate a correct expec- 
tation versus a misconception. 


Future AutoTutor Directions and Trialogues 


Many spinoffs from AutoTutor have been developed after its inception in 1997 and the 
publication of Graesser et al. (2001). Nye et al. (2014) reported that dozens of systems 
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have evolved from AutoTutor in the Institute for Intelligent Systems at the University 
of Memphis. These systems have covered many STEM topics, reading comprehen- 
sion, writing, and scientific reasoning, with names like DeepTutor, GuruTutor, 
GnuTutor, AutoMentor, iDRIVE, iSTART, Writing-Pal, Operation ARIES (and 
ARA). A recent system has integrated AutoTutor with ALEKS in mathematics, 
a system commercialized by McGraw-Hill that has helped middle school stu- 
dents in the Memphis area (Hu et al. 2012). The Memphis team has recently 
started developing AutoTutor for basic electronics and electricity in an 
ElectronxTutor that is funded by Office of Naval Research. The suite of 
AutoTutor applications is starting to cover a large curriculum landscape. 
Researchers at other universities, businesses, and organizations are increasingly 
licensing the AutoTutor Script Authoring Tool (Cai et al. 2015) to develop their 
own content and integrate it with our generic AutoTutor Conversation Engine 
(ACE). For example, Wolfe et al. (2015) used the AutoTutor authoring tools to 
develop a website on genetic risk factors for breast cancer, called BRCA, and 
reported learning gains above the existing web site on the same topic. 
Educational Testing Service is licensing ASAT for assessment on a variety of 
competencies (English Language Learning, science, mathematics) in the context 
of virtual worlds with agents (Zapata-Rivera et al. 2015). The Army Research 
Laboratory has incorporated AutoTutor in its open source Generalized 
Intelligent Framework for Tutoring (GIFT, Sottilare et al. 2013). AutoTutor is 
growing further as it migrates to new systems with new names and 
applications. 

In recent years we have developed trialogues, which involve the human 
interacting with two agents, typically a student agent and a tutor agent in 3- 
party conversations (Graesser et al. 2014; Graesser et al. 2015a, b; Millis et al. 
2011). Two agents add considerable benefits theoretically because the two 
agents can model successful conversational interactions, such as asking good 
questions and receiving good answers (Gholson et al., 2009) or staging argu- 
ments that create cognitive disequilibrium, productive confusion, and deeper 
learning (D’Mello et al. 2014; Lehman et al. 2013). The trialogues can help 
rectify some of the problems previously discussed on AutoTutor dialogues. For 
example, when the human’s answer is incomplete, the student agent can fill in 
the missing words and articulate a more complete answer; this not only models 
good answers but also circumvents any negative short feedback to the human. 

Graesser et al. (201 5b) identified seven trialogue designs that can be used in learning 
environments. The two agents in each design can take on different roles, but typically 
one is a tutor and the other a student peer. 


(1) Vicarious learning with human observer. Two agents interact and model ideal 
behavior, answers to questions, or reasoning. 

(2) Vicarious learning with limited human participation. The same as #1 except that 
the agents occasionally turn to the human and ask a prompt question, with a yes/ 
no or single-word answer. 

(3) Tutor agent interacting with human and student agent. There is a tutorial dialogue 
with the human, but the student agent periodically contributes and receives 
feedback. 
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(4) Expert agent staging a competition between the human and a peer agent. There is 
a competitive game between the human and peer agent, with the expert agent 
organizing the event. 

(5) Human teaches/helps a student agent with facilitation from the tutor agent. As the 
human tries to help the peer agent, the tutor agent rescues a problematic situation. 

(6) Human interacts with two peer agents that vary in proficiency. The peer agents 
can vary in knowledge and skills. 

(7) Human interacts with two agents expressing contradictions, arguments, or differ- 
ent views. The discrepancies between agents stimulate cognitive disagreement, 
confusion, and potentially deeper learning. 


Our current hypothesis is that these seven trialogue designs should be adaptively 
administered, depending on the student’s knowledge and other psychological attributes. 
The vicarious learning designs (1 and 2) are appropriate for learners with limited 
knowledge, skills, and actions, whereas designs 5 and 7 are suited to the more capable 
students attempting to achieve deeper knowledge. Design 4 is motivating for learners by 
virtue of the game competition. Research needs to be conducted to assess empirically the 
conditions under which different trialogue designs facilitate learning and motivation. 

Trialogues have been routinely incorporated in our recent AutoTutor applications. 
Scientific reasoning is the focus in an instructional game called Operation ARIES! 
(Millis et al. 2011), which was subsequently commercialized by Pearson Education as 
Operation ARA (Halpern et al. 2012). ARIES is an acronym for Acquiring Research 
Investigative and Evaluative Skills whereas ARA is an acronym for Acquiring 
Research Acumen. Agent trialogues are currently being developed in computer inter- 
ventions to train comprehension strategies for adults with reading difficulties in the 
Center for the Study of Adult Literacy (CSAL, http://csal.gsu.edu/content/homepage). 
Interestingly, some trialogue designs have always been used in McNamara’s iSTART 
trainer for reading comprehension (McNamara et al. 2006). ETS is currently using 
trialogues for assessment and is licensing our ASAT and ACE facilities for that purpose 
(Zapata-Rivera et al. 2015). 

It is of course possible to build systems with more than two agents and more than 
one human. One can imagine communities of humans and cyber agents interacting in 
varying numbers. The cyber agents will need conversation mechanisms that are 
adaptive and flexible in a similar vein as AutoTutor dialogues and trialogues. At that 
point we enter the arenas of collaborative problem solving (Fiore et al. 2010; Graesser 
et al. 2015a) and computer supported collaborative learning (Dillenbourg 1999; Rosé 
et al., 2008). These are two areas on our horizon during the next decade. 
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